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Abstract 


The true online TD(A) algorithm has recently been proposed (van Seijen and Sutton 


20141 as a universal replacement for the popular TD(A) algorithm, in temporal-difference 


learning and reinforcement learning. True online TD(A) has better theoretical properties 
than conventional TD(A), and the expectation is that it also results in faster learning. In 
this paper, we put this hypothesis to the test. Specifically, we compare the performance 
of true online TD(A) with that of TD(A) on challenging examples, random Markov reward 
processes, and a real-world myoelectric prosthetic arm. We use linear function approxima¬ 
tion with tabular, binary, and non-binary features. We assess the algorithms along three 
dimensions: computational cost, learning speed, and ease of use. Our results confirm the 
strength of true online TD(A): 1) for sparse feature vectors, the computational overhead 
with respect to TD(A) is minimal; for non-sparse features the computation time is at most 
twice that of TD(A), 2) across all domains/representations the learning speed of true online 
TD(A) is often better, but never worse than that of TD(A), and 3) true online TD(A) is 
easier to use, because it does not require choosing between trace types, and it is gener¬ 
ally more stable with respect to the step-size. Overall, our results suggest that true online 
TD(A) should be the first choice when looking for an efficient, general-purpose TD method. 


1. Introduction 


Temporal-difference (TD) learning (Sutton 1988) is a core learning technique in modern 


reinforcement learning (]Kaelbling et al. 1996 Sutton and Barto 1998 Szepesvari 2010). 


One of the main challenges in reinforcement learning is to make predictions, in an initially 
unknown environment, about the (discounted) sum of future rewards, the return, based on 
currently observed feature values and a certain behaviour policy. With TD learning it is 
possible to learn good estimates of the expected return quickly by bootstrapping from other 


expected-return estimates. TD(A) (Sutton, 1988) is a popular TD algorithm that combines 
basic TD learning with eligibility traces to further speed learning. 

The ability of TD(A) to speed learning is explained by its forward view, which states 
that the estimate at each time step is moved toward an update target known as the A-return, 
where the A-parameter determines the trade-off between bias and variance of the update 
target. This trade-off has a large influence on the speed of learning and its optimal setting 
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varies from domain to domain. The ability to improve this trade-off by adjusting the value 
of A enables eligibility traces to improve the learning speed. 

True online TD(A) (|van Seijen and Sutton 2014) is a recently proposed variation of 


TD(A) with better theoretical properties. Specifically, it maintains an exact equivalence to 
the forward view at all times. In contrast, TD(A) accurately approximates the forward view 
only for appropriately small step-sizes. Hence, it can be expected that true online TD(A) 
can do a better job in improving the learning speed. Initial experiments suggest that this 
is indeed the case (van Seijen and Sutton, 2014). However, no significant empirical study 


has been performed so far. In this paper, we empirically compare true online TD(A) with 
TD(A) on a wide variety of domains. 


2. Markov Reward Processes 

We focus in this paper on discrete-time Markov reward processes (MRPs), which can be 
described as 4-tuples of the form (5,p, r, 7 ), consisting of 5, the set of all states; p(s'|s), the 
transition probability function, giving for each state s £ S the probability of a transition 
to state s' G 5 at the next step; r{s,s'), the reward function, giving the expected reward 
after a transition from s to s'. 7 is the discount factor, specifying how future rewards 
are weighted with respect to the immediate reward. An MRP can contain terminal states, 
dividing the sequence of state transitions into episodes. When a terminal state is reached 
the current episode ends and the state is reset to the initial state. The return at time step 
t is the discounted sum of rewards observed after time step t: 

00 

i=l 

where Rk is the reward received at time k. For an episodic MRP, the return is the discounted 
sum of rewards up to the time step that the terminal state is reached. 

We are interested in learning the value-function v of an MRP, which maps each state 
s G 5 to the expected value of the return: 

u(s) = E{Gt \St = s}. 

In the general case, the learner does not have access to s directly, but can only observe a 
feature vector 0(s) G M"'. We estimate the value function using linear function approxima¬ 
tion, in which case the value of a state is the inner product between a weight vector 6 and 
a feature vector 0. In this case, the value of state s is approximated by: 

n 

v{s, e) = 6^ cf){s) = '^0i (f)i{s). 

i=l 

As a shorthand, we will indicate cj>{St), the feature vector of the state visited at time step 
t, by 4>t. 


3. Algorithms 

This section presents the two methods that we compare: conventional TD(A) and true 
online TD(A). 
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3.1 Conventional TD(A) 

The conventional TD(A) algorithm is defined by the following update equations; 

St = Rt+i + 7^7 “ ^7 4>t ( 1 ) 

et = + cjjt ( 2 ) 

Ot+i = 0t + OiSt et (3) 

for t > 0, and with e_i = 0. The scalar 6t is called the TD error. The vector e* is called 
the eligibility-trace vector, and the parameter A G [0,1] is called the trace-decay parameter. 

TD(A) can be very sensitive with respect to the a and A parameters. Especially, a large 
value of A combined with a large value of a can easily cause divergence, even on simple 
tasks with bounded rewards. For this reason, a variant of TD(A) is often used that is more 
robust with respect to these parameters. This variant, which assumes binary features, uses 
a different trace-update equation: 





if (piiSt) = 0 
if MSt) = 1 


for all i. 


When TD(A) uses this equation to update its elegibility-trace vector, it is said to use 
replacing traces] in contrast, the default implementation based on trace update © is said 
to use accumulating traces. In this paper, we will indicate these implemenations by ‘replace 
TD(A)’ and ‘accumulate TD(A)’, respectively. In our experiments, we compare against both 
versions. 


3.2 True Online TD(A) 

The true online TD(A) update equations are: 


St 

= Rt+I + 7^7 0t+i - ^7 

( 4 ) 

et 

= 7 Aei_i -b - a'y\[e'l_^ cj)t] cfjt 

( 5 ) 

Ot+i 

= 6t + aSt et + a[0j4)t - 0]_i4)t][et - 4)t] 

( 6 ) 


for t > 0, and with e_i = 0. Compared to accumulate TD(A), both the trace update and 
the weight update have an additional term. We call a trace updated in this way a dutch 
trace] we call the term a[0jtpt — (^J-i<Pt][et — <Pt] the TD-error time-step correction, or 
simply the ^-correction. For A = 0, reduces to in which case the J-correction is 0. 
Hence, for A = 0, true online TD(A) reduces to the regular TD(0) method. 

Algorithm [l] shows pseudocode that implements true online TD(A)0 In order to discuss 
its computational cost, let n be the total number of features, and m the number of features 
with a non-zero value. Then, the number of basic operations (addition and multiplication) 
per time step for conventional TD(A) is 3n -|- 5m. True online TD(A) takes another 6m, 
resulting in 3n -|- 11m operations in totalj^ Hence, if sparse feature vectors are used (that 

1. In the pseudocode Void is initialized to 0, but any (finite) value would do, because for the first update of 
an episode, et is equal to cftt, and hence the 5-correction is equal to zero, regardless of the value of Void- 

2. Note that computing and adding the vector ajX{e^tp) <p requires only 4m operations. 
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is, if m << n) the computational overhead of true online TD(A) is minimal. If non-sparse 
feature vectors are used (that is, m = n) TD(A) and true online TD(A) require 8n and 14n, 
respectively. So in this case, true online TD(A) is roughly twice as expensive as conventional 
TD(A). 


Algorithm 1 true online TD(A) 

INPUT: a, X,'-f,Oinit 

^ Void ^ 0 

Loop (over episodes): 
obtain initial (p 
e ^ 0 

While terminal state has not been reached, do: 
obtain next feature vector cp' and reward R 
V O'^ cp 
v' ^ cp' 

5^R + '^v' — V 

e *r- yAe + (p — a'yX{e^(p) cp 

9^0 + a{5 + v - Void) e - a{v - Void)(p 

Void ^ V' 

cP^cP' 


4. Empirical Comparisons 

In this section, we compare the performance of true online TD(A) with that of accumulate 
TD(A) and replace TD(A). First, we compare the behaviour on two challenging examples, 
then on random MRPs, and hnally on a real-world data set. For the experiments on the 
random MRPs and the real-world data set, all methods use the same sample sequence (and 
they start from the same initial values). Hence, a lower error corresponds with a higher 
learning speed. 

4.1 Challenging Examples 

For our first experiments, we designed two small examples that are challenging for either 
accumulate TD(A) or replace TD(A) to check if true online TD(A) can deal well with such 
problems (see Figure]^). The first one, a one-state example, is challenging for accumulate 
TD(A), while the second one, a two-state example, is challenging for replace TD(A). The 
challenging part of the one-state example is that the same feature is revisited frequently 
within the same episode; the challenging part of the two-state example is that the value 
function cannot be represented exactly (because the two states have different values, but 
are represented in the same way). 

The left graph of Figure shows the early learning performance on the one-state example 
at A = 1 and for different step-size values. Replace TD(A) and true online TD(A) do very 
well on this task. For a = 1, the state-value converges after a single episode already, 
resulting in an average RMS error of zero. This is not surprising, given that the return 
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r = 0 

p^O.8 



0=1 0 = 0 


0=1 0=1 0 = 0 


Figure 1: Left: One-state example. Right: Two-state example. Circles indicate states; 
squares are terminal states; arrows indicate state transitions {p is the transition probability). 
In both examples, the state-space is represented by a single, binary feature 0. 


has zero variance. However, the error for accumulate TD(A) diverges at these settings, and 
even with optimized step-size, the RMS error does not reach zero. 

The right graph of Figure shows the RMS error after approximate convergence for 
different A values on the two-state example. Also the RMS error for the least mean squares 
(LMS) solution is shown. For accumulate TD(A) the error is equal to the LMS solution for 
A = 1, but gets worse for smaller A. This corresponds with the theory, which states that the 
fixed point of accumulate TD(A) equals the LMS solution for A = 1, but is different from 
the LMS solution for A < 1 (Dayan, 1992). True online TD(A) has the same behaviour. 
Surprisingly, for replace TD(A) the value of A has no effect. This task illustrates the main 
weakness of replace TD(A): while it avoids the divergence issues of accumulate TD(A), it 
is an overly conservative approach. It avoids divergence by resetting a trace each time a 
feature is revisited, which reduces the overall effect of a trace. In the extreme case, the 
effect can be removed completely, which is what happens here. 

Overall, these tasks show that while accumulate TD(A) and replace TD(A) each have 
their weakness, true online TD(A) does not suffer from these weaknesses. Of course, the 
problems we constructed here are extreme examples. In practise, tasks will be not be so 
one-sided, but they can have properties from both examples. 




Figure 2: Left: RMS error at the end of an episode, averaged over the first 10 episodes, on 
the one-state example (using A = 1). Right: RMS error after approximate convergence on 
the two-state example (using a = 0.01). We considered values to be converged if the error 
changed less than 1% over the last 100 time steps. 
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4.2 Random MRPs 


For our second series of experiments we used randomly constructed MRPsj^ We represent 
a random MRP as a 3-tuple {k, b, a), consisting of k, the number of states; b, the branching 
factor (that is, the number of possible next states per transition); and a, the standard 
deviation of the reward (the expected value is drawn at random from a normal distribution 
with zero mean). We compared the performance on three different MRPs: one with a small 
number of states, (10,3,0.1), one with a larger number of states, (100,10,0.1), and one 
with a low branching factor and no stochasticity in the reward, (100,3,0). We evaluated 
each MRP using three different representations: one with tabular features, one with binary 
features and one with non-binary, normalized features (normalized such that the length of 
the feature vector is always 1). For more details on the representations, see appendix A. For 
each domain/representation/method combination we performed a scan over a and A values 
to determine the best performance for each combination. As performance metric we used 
the mean-squared error (MSE) with respect to the LMS solution during early learning (for 
k = 10, we averaged over the first 100 time steps; for k = 100, we averaged over the hrst 
1000 time steps). We normalized this error by dividing it by the MSE error obtained for 
A = 0 for the relevant domain/representation combination (all three TD(A) methods reduce 
to the same algorithm for A = 0). In addition, we averaged over 50 independent runs. 
Eor each domain/representation/method we used the same (randomly generated) sample 
sequences]^ 

Eigurej^ shows the results of the comparisons for each domain/representation combina¬ 
tion. Because A = 0 lies in the parameter range that is being optimized over, the normalized 
error can never be higher than 1. If for a method/domain the normalized error is equal to 
1 , this means that setting A higher than 0 either has no effect, or that the error gets worse. 
In either case, eligibility traces are not effective for that domain/representation/method 
combination. 



tabular binary normal tabular binary normal tabular binary normal 

Eigure 3: Normalized MSE error at the best parameter settings for all MRP experiments. 
The error is normalized by dividing it by the MSE error obtained for A = 0. 


3. The process we used to construct these MRPs is based on the process used by Bhatnagar, Sutton, 
Ghavamzadeh and Lee (2009). 

4. The code for the MRP experiments is published online at: https://github.com/armahmood/totd- 
rndmdp-experiments 
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The results confirm the strength of true online TD(A). The optimal performance of true 
online TD(A) is, on all domains and for all representations, at least as good as the optimal 
performance of replace TD(A) and of accumulate TD(A). Specifically, true online TD(A) 
outperforms conventional TD(A) on 5 of the 9 domains/representations considered. In the 
next subsection, we compare the methods using real-world data. 


4.3 Predicting Signals from a Myoelectric Prosthetic Arm 


In this experiment, we compare the performance of true online TD(A) and conventional 
TD(A) on a real-world data-set consisting of sensorimotor signals measured during the 
human control of an electromechanical robot arm. The source of the data is a series of 


manipulation tasks performed by a participant with an amputation, as presented by Pilarski 


et al. (2013). In this study, an amputee participant used signals recorded from the muscles 


of their residual limb to control a robot arm with multiple degrees-of-freedom (Figure 0 
left). Interactions of this kind are known as myoelectric control (c.f., Parker et al.| 2006). 


For consistency and comparison of results, we used the same source data and prediction 


learning architecture as published in Pilarski et al. (2013). In total, two signals are pre 


dieted: grip force and motor angle signals from the robot’s hand. Specifically, the target 
for the prediction is a discounted sum of each signal over time, similar to return predictions 


(c.f., general value functions and nexting; 

Sutton et al. 

2011 

Modayil et al., 2014). 

possible, we used the same implementation and code base as 

Pilarski et al. 

(2013) 


Data 

for this experiment consisted of 58,000 time steps of recorded sensorimotor information, 
sampled at 40 Hz (i.e., approximately 25 minutes of experimental data). The state space 
consisted of a tile-coded representation of the robot gripper’s position, velocity, recorded 
gripping force, and two muscle contraction signals from the human user. A standard imple¬ 
mentation of tile-coding was used, with ten bins per signal, eight overlapping tilings, and 
a single active bias unit. This results in a state space with 800,001 features, 9 of which 
were active at any given time. Hashing was used to reduce this space down to a vector of 
200,000 features that are then presented to the learning system. All signals were normalized 
between 0 and 1 before being provided to the function approximation routine. The discount 
factor for predictions of both force and angle was 7 = 0.97, as in the results presented by 


Pilarski et al. (2013). Parameter sweeps over A and a are conducted for all three methods. 


The performance metric is the mean absolute return error over all 58,000 time steps of 
learning, normalized by dividing by the error for A = 0. 


Figurej^shows the performance for the angle as well as the force predictions. The relative 
performance of replace TD(A) and accumulate TD(A) depends on the predictive question 
being asked. For predicting the robot’s grip force signal—a signal with small magnitude and 
rapid changes—replace TD(A) is better than accumulate TD(A) at all non-zero A values. 
However, for predicting the robot’s hand actuator position, a smoothly changing signal that 
varies between a range of ~300-500, accumulate TD(A) dominates replace TD(A) over all 
non-zero A values. True online TD dominates both methods for all non-zero A values on 
both prediction tasks (force and angle). 
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Figure 4: Left: picture of experimental setup. Middle: normalized error of the predictions 
for different A at the best a value, for the force predictions. Right: same, but for angle 
predictions. 


5. Conclusions 

We have compared true online TD(A) to conventional TD(A) along three broad dimensions: 
computational cost, learning speed, and ease of use. In terms of computational cost, TD(A) 
has a slight advantage. In the worst case, true online TD(A) is twice as expensive. In the 
typical case of sparse features, it is only fractionally more expensive than TD(A). Memory 
requirements are the same for both methods. In terms of learning speed, in our experiments 
true online TD(A) was usually better and never worse than TD(A). Specifically, true online 
TD(A) substantially outperformed TD(A) on 5 out of the 9 MRPs and in both myoelectric- 
arm experiments. Finally, in terms of ease of use, we conclude that true online TD(A) has 
a clear advantage. The first difficulty with conventional TD(A) is that typically one must 
choose between its two types of traces, whereas with true online TD(A) no such choice has 
to be made. A second difficulty for accumulate TD(A) is that its performance can be very 
sensitive to the step-size parameter (e.g., see Figure left), making it hard to find an 
acceptable value. Overall, our results suggest that true online TD(A) should be the first 
choice when looking for an efficient, general-purpose TD method. 


Appendix A. Details of MRP Experiments 

Let k be the number of states in a domain. For the tabular representation each state is 
represented with a unique standard-basis vector of k dimensions. The binary representation 
is constructed by first assigning indices, from 1 to k, to all states. Then, the binary encoding 
of the index of a state is used as a feature vector to represent that state. The length of a 
feature vector is determined by the total number of states: for k = 10, the length is 4; for 
k = 100, the length is 7. As an example, for k = 10 the feature vectors of states 1, 2 and 
3 are (0,0,0,1), (0,0,1,0) and (0,0,1,1), respectively. Finally, for the non-binary, normal 
representation each state is mapped to a 5-dimensional feature vector, with the value of 
each feature drawn from a normal distribution with zero mean and unit variance. After all 
the feature values for a state are drawn, they are normalized such that the feature vector 
has unit length. Once generated, the feature vectors are kept fixed for each state. 
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